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A molecular dynamics calculation of the amino acid polar requirement is presented and used 
to score the canonical genetic code. Monte Carlo simulation shows that this computational polar 
requirement has been optimized by the canonical genetic code more than any previously-known 
measure. These results strongly support the idea that the genetic code evolved from a communal 
state of life prior to the root of the modern ribosomal tree of life. 
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The genetic code is one of life's most ancient and uni- 
versal features[l|,0]. It summarizes how RNA transcripts 
are translated into amino acids to form proteins, and is 
shared by all known cells, across the three domains of life, 
with only a few very minor variations0,[3|. Representing 
a complex series of biochemical steps that comprise all 
known cell's translation apparatus, the canonical genetic 
code is a mapping from the space of triplets of nucleotide 
codons to the space of amino acids. Sequences of codons 
then correspond to protein sequences, and ultimately give 
rise to each organism's phenotype. 

Almost immediately after its elucidation, attempts 
were made to explain the assignment of codons to amino 
acids. It was noticed that amino acids with related prop- 
erties were grouped together, which would have the ef- 
fect of minimizing translation errors [H @, 0]. In order 
to determine whether or not this was a genuine corre- 
lation or simply a fluctuation reflecting the limited size 
of the codon table, the canonical genetic code was com- 
pared to samples of randomly-generated synthetic codes, 
starting with early but inconclusive Monte Carlo work 
of Alff-SteinbergerQ, and compellingly revisited with 
larger sample sizes by Haig and Hurst [9|]. Depending on 
the measure used to characterize or score the sampled 
codes, high degrees of optimality have been reported. 
For example, using an empirical measure of amino acid 
differences referred to below as the "experimental po- 
lar requirement" (EPR) [13, 11 1, Freeland and Hurst 



calculated that the genetic code is "One in a million" 
@> O) [HI • More recently, it has been shown that when 
coupled to known patterns of codon usage, the canonical 
code (and the codon useage) is simultaneously optimized 
with respect to point mutations and to the rapid termi- 
nation of peptides that are generated with frame shift 
errors 

These results are generally interpreted to imply that 



the canonical genetic code had to have undergone 
a period of evolution, and was not simply a frozen 
accident fljl Il6j |. While it was long assumed that code 
evolution would be lethal, it has been recently shown 
how a genetic code can evolve along with a dynamic 
refinement of the precision of translation [It], [H[. Un- 
der vertically-dominated evolutionary dynamics, the op- 
timization of the code is relatively weak: the system 
gets trapped in "local minima" and is neither strongly 
optimized nor converged to a unique code. On the 
other hand, if the evolutionary dynamics is horizontally- 
dominated, with genes shared between organisms (as 
is the case with contemporary microbes (lijh modular- 
ity of structures such as the translation apparatus and 
the genome emerges naturally (2fjj|. and optimization is 
strong, rapid and convergent to a universal genetic code, 
suggesting that early life was communal in nature and 
collective in its dynamics Thus, the extent of opti- 
mality of the canonical genetic code is of great interest, 
because the greater the level of optimization, the more 
likely it is that the genetic code evolved when life was 
communal in character. 

The purpose of this Letter is to set a lower bound on 
the level of optimality of the canonical genetic code. We 
do this by using molecular dynamics to construct a mea- 
sure of code optimality without any input from experi- 
ment, specifically by simulating the equilibrium behavior 
of free amino acids in water-pyridine solutions, resem- 
bling those of the original polar requirement experiments, 
and constructing a "computational polar requirement" 
(CPR) by analysis of certain two-point correlation func- 
tions. We then use Monte Carlo simulation to determine 
the level of code optimality, and find that the level is so 
high that a new and detailed error analysis is required to 
ensure statistically significant assessment of very small 
probabilities. We also explore the dependence of our re- 
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suits on the sensitivity of code optimality to the scale of 
code variations. Our results lend strong support to evo- 
lutionary scenarios for the structure of the genetic code, 
with a level of optimization that would only be attain- 
able from some form of collective dynamics We also 
report indications that the dynamics involved the refine- 
ment over evolutionary time of a primitive translation 
machinery that was ambiguous, generating a statistical 
ensemble of related proteins ( "statistical proteins" ) 0, EJ 
rather than a unique protein, as is now the case. 

Molecular Dynamics of the Polar Requirement:- The ex- 
perimental polar requirement is a chromotagraphic mea- 
sure of amino acid affinity to a water-pyridine solution 
that was originally motivated by a simple stereochemical 
theory of the origin of the genetic code 0, E3, 11, H2|- 



This measure is related to, and strongly correlated with, 
several other amino acid measures, such as hydropho- 
bicity and Grantham polarity [23} , but is not simply re- 
lated to these other measures. The original polar require- 
ment experiments used partition chromatography. In the 
EPR experiments, water/dimethylpyridine (DMP) ratios 
ranging from 40-80% mole fraction water were used for 
each amino acid measured. When the chromatographic 
factor, R m was plotted as a function of mole fraction wa- 
ter in log-log scale, a linear trend was observed for each 
amino acid. The slope of the corresponding best fit line 
was taken to be the amino acid's EPR. 

The methods used for obtaining the computational po- 
lar requirement numbers (CPR) have been reported else- 
where [24I ] and are summarized here. The distribution of 
solute molecules across the water /DMP interface is re- 
lated to the equilibrium solvent environment surrounding 
the molecules in a binary solution similar to that used in 
the experiments. Trends in the local water density of a 
solvated amino acid in water/DMP solutions were found 
to be linear functions of mole fraction water. The slopes 
of these linear trends were used to obtain a set of com- 
puted CPR values. To quantitatively measure the differ- 
ences in local solvent environment, molecular dynamics 
(MD) calculations were performed using NAMD2 with 
an NPT ensemble 25[ and the Charmm 27 force field 



(2 61 . |27| |. A pressure of 1.01325 bar and temperature of 
300K were maintained for each simulation. The systems 
consisted of a single amino acid molecule in a box of wa- 
ter and randomly placed DMP molecules of a determined 
water/DMP ratio. For each amino acid at least four sys- 
tems, each with a different water/DMP ratio, were sim- 
ulated. Radial distribution functions (RDFs) of water 
relative to the amino acid side chains were calculated 
from the equilibrated MD trajectories using VMD (Vi- 
sual Molecular Dynamics) [28j. The RDFs were calcu- 
lated by a time average over the equilibrated portion of 
a trajectory [iij ]. 

The most distant atom of the amino acid side chain 
was used as a reference atom, and the oxygen or hydro- 



gens (as appropriate) from the water molecules were used 
as a selection in calculating the RDFs. Calculated in this 
manner, the maximum value of the first peak in an RDF 
is related to the relative density of water in the first solva- 
tion layer of the amino acid side chain. It was found that 
these maxima varied linearly with water/DMP ratios for 
each amino acid, and that the slopes of the correspond- 
ing lines was strongly correlated with the experimental 
PR (R 2 = 0.92) (Fig. HJ. We confirmed that tyrosine's 
large deviation from the experimental value was not due 
to a weak signal in the RDF. 
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FIG. 1: Scatter plot showing the relationship between RDF 
peak slope and experimental polar requirement for all amino 
acids. The straight line is a guide to the eye. 



Optimality analysis of the canonical genetic code:- To an- 
alyze the CPR, we used the point mutation code analysis 
algorithms described in § and along with an ana- 
lytical realization of bootstrap error analysis to assess 
the statistical significance of the results. The algorithms 
treat the genetic code as a mapping GC l : Codons — > 
Amino Acids, where i indexes a particular set of assign- 
ments of codons to amino acids, with GC 1 as the canon- 
ical code. Codons is the set of codons excepting the ter- 
mination codons, and Amino Acids is the set of amino 
acids, i.e. GC^UUU) = Phe. New versions GC 1 ^ 1 
of the mapping are generated by randomly permuting 
amino acid labels, leaving termination codons fixed. This 
preserves the degeneracy structure of the genetic code. 
The optimality of a given realization of the genetic code 
GC 1 is assessed by evaluating the sum 
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(c,c')^Ter 



W c , c > d 9 (GC i (c),GC' i (c')) (!) 



where (c, c') ^ Ter denotes a sum over nearest neighbor 
codons with the nearest neighbors of a codon defined by 
its single point mutations, with all mutations to or from 
a termination codon excluded. The matrix W c . c > weights 
transition and transversion biases differently for different 
positions in the codon, according to a toy model of typical 
transversion/transition biases in real translation. In our 
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TABLE I: The matrix W C)C i of transition/transversion biases 
taken from 1121. 





First Base 


Second Base 


Third base 


Transitions 


1 


0.5 


1 


Transversions 


0.5 


0.1 


1 



calculations, we used the values from [12J as listed in 
tableQ] Finally, d q (x, y) is a metric on the space of amino 
acids. For the polar requirement, the metric is taken to 



be d q (x,y) 



y\ over the polar requirement values 



corresponding to the given amino acid. 

The appropriate quantity to compute is the probabil- 
ity Pb = Pr(0 > Oi) that a random realization is more 
optimal than the canonical code. To compute Pb, we 
count the number of randomly generated codes that are 
more optimal than the canonical code and divide by the 
total number of random codes generated. Pb is invari- 
ant to uniform linear rescaling of the amino acid polar 
requirement data, and is smaller for more optimal codes 
while including the effects of the large number of codes 
that can be explored, rather than the simple linear scale 
provided by the bare optimality score. 

The error in the computed Pb can be estimated using 
an analytical realization of bootstrap resampling. Sim- 
ulated data sets for bootstrap are created by randomly 
sampling optimality scores from the original data set. 
When the samples are drawn from the original set, there 
are only two alternatives: a more, or less optimal code 
can be sampled, with probability Pb — No>o 1 /Ntotai of 
drawing a random code better than the canonical code. 
Since the number of better codes in a sample is the num- 
ber whose error we wish to estimate, we can regard draw- 
ing a better code as a step to the right with probability Pb 
in a one dimensional random walk. The known formulas 
for the asymmetric one dimensional random walk allow 
us to immediately write down the bootstrap error esti- 
mate in the limit of infinitely many resampled sets, i.e. 
the exact bootstrap estimate. For metrics under which 
the code is fairly optimal (i.e. Pb <C 1), we obtain the 
variance in Pb to be 



var\Pb 



[ %o, j = A(l - Pb) _ N 0>0l 

N total N total 



N 2 , 

total 



(2) 



To obtain a reasonable estimate of error, or to com- 
pare the results of different metrics on the space of 
amino acids, the number of more optimal codes, NoOt 
from the random sample must be sufficiently large 
{y/No>Oi < N >On or about N >o 1 = 10 as a rea- 
sonable minimum). 

When the computational polar requirement difference 
squared is used in the amino acid metric Pb = (19 ± 
4.36) x 1CP 8 . In contrast, with the experimental polar 
requirement, Pb = (26. 5± 1.63) x 10~ 7 , and order of mag- 
nitude improvement. To assess the impact of tyrosine 



(which had the largest variation between the CPR and 
EPR values) on these results we redid the calculation of 
Pb for the CPR, but with tyrosine replaced with the value 
from the EPR. The result is (P b = (9.3 ± 1.0) x 10" 7 ). 
To test the sensitivity of the results for the CPR, we var- 
ied each element of W c _ c > independently by ±0.1 x W CjC > 
and repeated the calculation of P . This led to results 
that were statistically indistinguishable from the results 
reported above. Shorter computations (justified by the 
faster convergence due to decreased optimality) for the 
EPR indicate a similar level of robustness. With a W CjC i 
uniform among nearest neighbors we saw substantial in- 
creases in Pb in agreement with However, the CPR 
continued to be superior to the EPR, with the CPR 
yielding P b = (3.7 ± .61) x 10~ 5 and the EPR yielding 
P b = (11.8 ± 1.1) x 10" 5 . 

Varying the value of q in the metric (30l | provides a 
further probe to explore the optimization of the genetic 
code. Increasing the value of q is equivalent to empha- 
sizing the role of larger and larger differences between 
the amino acid intended, and the one generated by point 
mutation. Thus, if Pb reduces for increased values of q, 
the code (along with W c , c >) evolved to suppress the ef- 
fects of rarer, possibly catastrophic errors that may be 
generated by point mutations. This may happen primar- 
ily by evolving small elements of W c . c i where c — ► d is 
catastrophic, or vice versa. Conversely, if Pb reduces for 
smaller values of q, the code evolved to both mitigate the 
possibility of these catastrophic errors, and to minimize 
the effects of frequent, small errors. Varying q we find 
that the canonical genetic code is most optimal for q be- 
tween one and two with significant increases outside this 
regime in either direction (Fig. [2j. This indicates that 
the genetic code is optimized for minimizing errors ac- 
cording to their size with no undue emphasis to larger or 
smaller errors. Given the relative weakness of the code 
when emphasizing large errors, evolution must have fa- 
vored organisms that discarded or edited fatally flawed 
proteins over evolving the code to make them less likely 
at the cost of reducing its ability to minimize the more 
frequent moderate and minor errors. The weakness of the 
canonical code when minor errors are emphasized (q < 1) 
suggests that while the code was still evolving, minor 
errors were on the whole less important biologically, as 
would be expected in evolutionary dynamics [17L |l8| that 
utilized ambiguity tolerance in early proteins [7J, |21| | . 

Optimality analysis of alternative codes and measures:- 
A selection of variant codes were also analyzed using the 
CPR. Our findings, displayed in table HT1 were consistent 
with the previous findings of Knight in that the alter- 
native codes did not show marked improvements in op- 
timality over the canonical code [l3j ]. This is consistent 
with our expectation that evolutionary pressure to opti- 
mize the code with respect to the polar requirement was 
eased after the last universal ancestral state. 
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FIG. 2: p, as a function of the exponent q in the amino acid 
metric. 



TABLE II: p, for several naturally occurring variant codes 



Code 


Pb 


Canonical 

Yeast Mitochondrial 
CDH Nuclear Code 
Ascidian Mitochondrial 
Echinoderm Mitochondrial 


(19 ±4.36) x 1(T 8 
(11 ±3.32) x 1(T 8 
(21 ±4.58) x 1(T 8 
(583 ±24.15) x 1(T 8 
(51 ± 7.14) x 10" 8 



We also tested Grantham polarity [23[ , which has been 
argued in a survey of genetic code optimality under differ- 
ent amino acid measures to be the amino acid measure 
most optimized by the genetic code (l3j ] . The results 
yield P b = (285 ± 16.88) x 1CT 8 , or an order of mag- 
nitude higher than with the CPR metric, leading to the 
conclusion that the CPR is the most effective known met- 
ric for optimization of the genetic code. Previous com- 
putations evaluated P b by generating 100, 000 random 
codes [13| . Scaling our results to the size of these origi- 
nal simulations, we see that the EPR and the Grantham 
polarity have virtually identical scores. Scaling the er- 
rors for the CPR and the Grantham polarity to errors 
assessed from only 100,000 codes, we get for the CPR, 
P b = (0.19 ± 0.44) x 10~ 5 and for the Grantham polar- 
ity, P b = (2.85 ± 1.69) x 10~ 5 . These results are within 
a standard deviation and a half of each other, and are 
therefore not different in a statistically meaningful way. 

In conclusion, earlier estimates of code optimality were 
understated by a statistically significant amount. The 
extent of optimality revealed here further supports the 
notion that the genetic code must have evolved during 
an early communal state of life[l^|. 

We gratefully acknowledge discussions with Carl 
Woese and thank the referees for helpful suggestions that 
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